Midterm

Author

Chenjia Kuang

Introduction

This dataset contains information on 13,932 single-family homes sold in Miami.

Question

Main question: What factor will influence the sale price?

Methods

library('R.utils')
Loading required package: R.oo
Loading required package: R.methodsS3
R.methodsS3 v1.8.2 (2022-06-13 22:00:14 UTC) successfully loaded. See ?R.methodsS3 for help.
R.oo v1.25.0 (2022-06-12 02:20:02 UTC) successfully loaded. See ?R.oo for help.

Attaching package: 'R.oo'
The following object is masked from 'package:R.methodsS3':

    throw
The following objects are masked from 'package:methods':

    getClasses, getMethods
The following objects are masked from 'package:base':

    attach, detach, load, save
R.utils v2.12.2 (2022-11-11 22:00:03 UTC) successfully loaded. See ?R.utils for help.

Attaching package: 'R.utils'
The following object is masked from 'package:utils':

    timestamp
The following objects are masked from 'package:base':

    cat, commandArgs, getOption, isOpen, nullfile, parse, warnings
library('tidyverse')
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ tidyr::extract() masks R.utils::extract()
✖ dplyr::filter()  masks stats::filter()
✖ dplyr::lag()     masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library('ggplot2')
library('data.table')

Attaching package: 'data.table'

The following objects are masked from 'package:lubridate':

    hour, isoweek, mday, minute, month, quarter, second, wday, week,
    yday, year

The following objects are masked from 'package:dplyr':

    between, first, last

The following object is masked from 'package:purrr':

    transpose
library('dtplyr')
library('dplyr')
library('scales')

Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor

Read the dataset

I download data in csv form from The KAGGLE https://www.kaggle.com/datasets/deepcontractor/miami-housing-dataset

setwd("/Users/chenjiak/Downloads")
miami <- data.table::fread("miami-housing.csv")

Look at the data (EDA)

dim(miami)
[1] 13932    17
names(miami)
 [1] "LATITUDE"          "LONGITUDE"         "PARCELNO"         
 [4] "SALE_PRC"          "LND_SQFOOT"        "TOT_LVG_AREA"     
 [7] "SPEC_FEAT_VAL"     "RAIL_DIST"         "OCEAN_DIST"       
[10] "WATER_DIST"        "CNTR_DIST"         "SUBCNTR_DI"       
[13] "HWY_DIST"          "age"               "avno60plus"       
[16] "month_sold"        "structure_quality"
typeof(miami)
[1] "list"
#Rename the columns
miami <- miami %>%
  rename("lon" = "LONGITUDE",
         "lat" = "LATITUDE",
         "floor_area" = "TOT_LVG_AREA",
         "land_area"  = "LND_SQFOOT")
#Select the columns
miami <- miami[, .(lat, lon, SALE_PRC, land_area, floor_area, OCEAN_DIST, RAIL_DIST, CNTR_DIST, age, avno60plus, structure_quality, month_sold)]
#Take a closer look at the key variables.
table(miami$month_sold)

   1    2    3    4    5    6    7    8    9   10   11   12 
 833  929 1224 1234 1245 1387 1210 1275 1218 1057 1159 1161 
table(miami$structure_quality)

   1    2    3    4    5 
 179 4110   16 7625 2002 
summary(miami$SALE_PRC)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  72000  235000  310000  399942  428000 2650000 
summary(miami$land_area)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1248    5400    7500    8621    9126   57064 
summary(miami$floor_area)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    854    1470    1878    2058    2471    6287 
summary(miami$RAIL_DIST)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   10.5  3299.4  7106.3  8348.5 12102.6 29621.5 
summary(miami$OCEAN_DIST)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  236.1 18079.3 28541.8 31691.0 44310.7 75744.9 
summary(miami$CNTR_DIST)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   3826   42823   65852   68490   89358  159976 

We checked the dimension of our data and noticed that there are 13932 total observations and 17 different factors for each of our observation. I selected some variables that may effect the house price for further analysis. Then, I did some summaries for the key variables such as the distance to ocean, rail road, central business district, and the area of house land area and floor area. I found the lowest house price is $72000 and the highest price is $2650000. The smallest land area is 1248 square feet and the largest is 57064 square feet. For the floor area, the smallest is 854 square feet and the largest is 6287 square feet.

Variable creation

#Creatint a column about the quality for houses structure

miami <- miami %>%
  group_by(structure_quality) %>%
  mutate(structure.quality = case_when(
    structure_quality == 1 ~ "Fair",
    structure_quality == 2 ~ "Average",
    structure_quality == 3 ~ "Good",
    structure_quality == 4 ~ "Excellent Quality",
    structure_quality == 5 ~ "Superior Quality")) %>% collect()

table(miami$structure.quality)

          Average Excellent Quality              Fair              Good 
             4110              7625               179                16 
 Superior Quality 
             2002 
#Creatint a column about the quality for airport noise level

miami <- miami %>%
  group_by(avno60plus) %>%
  mutate(avno60_plus = case_when(
    avno60plus == 0 ~ "Accpet",
    avno60plus == 1 ~ "Exceed")) %>% collect()

table(miami$avno60_plus)

Accpet Exceed 
 13724    208 
knitr::kable(head(miami))
lat lon SALE_PRC land_area floor_area OCEAN_DIST RAIL_DIST CNTR_DIST age avno60plus structure_quality month_sold structure.quality avno60_plus
25.89103 -80.16056 440000 9375 1753 12811.4 2815.9 42815.3 67 0 4 8 Excellent Quality Accpet
25.89132 -80.15397 349000 9375 1715 10648.4 4359.1 43504.9 63 0 4 9 Excellent Quality Accpet
25.89133 -80.15374 800000 9375 2276 10574.1 4412.9 43530.4 61 0 4 2 Excellent Quality Accpet
25.89176 -80.15266 988000 12450 2058 10156.5 4585.0 43797.5 63 0 4 9 Excellent Quality Accpet
25.89182 -80.15464 755000 12800 1684 10836.8 4063.4 43599.7 42 0 4 7 Excellent Quality Accpet
25.89206 -80.16135 630000 9900 1531 13017.0 2391.4 43135.1 41 0 4 2 Excellent Quality Accpet

I created two variable about the house structure quality and airport noise level for further data analysis.

Preliminary Results

#Creating price per square foot features
miami$price_per_land_area <- miami$SALE_PRC / miami$land_area
miami$price_per_floor_area <- miami$SALE_PRC / miami$floor_area

summary(miami$price_per_land_area)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  3.084  30.271  43.869  52.885  63.333 396.040 
summary(miami$price_per_floor_area)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  44.17  133.44  163.38  188.87  209.82 1345.03 
#Creating relation graph for housing saling price and ocean distance which pretty make sense since miami is a  travel city

ggplot(miami, aes(x = OCEAN_DIST, y = SALE_PRC)) + 
  geom_point(aes(color = SALE_PRC), alpha = 0.6) + 
  theme_minimal() +
  labs(title = "Correlation between ocsan distance and the sale price",
       x = "Ocean distance", y = "Sale price") +
  scale_color_gradient(low = "blue", high = "red") + 
  stat_smooth(method = lm) +
  theme(legend.position = "none")
`geom_smooth()` using formula = 'y ~ x'

#Creating relation graph for housing saling price for land area per square feet and ocean distance

ggplot(miami, aes(x = OCEAN_DIST, y = price_per_land_area)) + 
  geom_point(aes(color = price_per_land_area), alpha = 0.6) + 
  theme_minimal() +
  labs(title = "Correlation between ocsan distance and the price per land area",
       x = "Ocean distance", y = "Price per land area") +
  scale_color_gradient(low = "blue", high = "red") + 
  stat_smooth(method = lm) +
  theme(legend.position = "none")
`geom_smooth()` using formula = 'y ~ x'

#Creating relation graph for housing saling price for floor area per square feet and ocean distance

ggplot(miami, aes(x = OCEAN_DIST, y = price_per_floor_area)) + 
  geom_point(aes(color = price_per_floor_area), alpha = 0.6) + 
  theme_minimal() +
  labs(title = "Correlation between ocsan distance and the price per floor area",
       x = "Ocean distance", y = "Price per floor area") +
  scale_color_gradient(low = "blue", high = "red") + 
  stat_smooth(method = lm) +
  theme(legend.position = "none")
`geom_smooth()` using formula = 'y ~ x'

#Creating relation graph for housing saling price and distance to the nearest rail line

ggplot(miami, aes(x = RAIL_DIST, y = SALE_PRC)) +
  geom_point(aes(color = SALE_PRC), alpha = 0.6) +
  theme_minimal() +
  scale_color_gradient(low = "blue", high = "red") +
  stat_smooth(method = lm) +
  labs(title = "Correlation between the distance to the nearest rail line and the sale price",
       x = "Rail distance", y = "Sale price") +
  theme(legend.position = "none")
`geom_smooth()` using formula = 'y ~ x'

#Creating relation graph for housing saling price and the distance to the Miami central business district

ggplot(miami, aes(x = CNTR_DIST, y = SALE_PRC)) +
  geom_point(aes(color = SALE_PRC), alpha = 0.6) +
  theme_minimal() +
  scale_color_gradient(low = "blue", high = "red") +
  stat_smooth(method = lm) +
  labs(title = "Correlation between the distance to the nearest Miami central business district and the sale price",
       x = "Central business district distance", y = "Sale price") +
  theme(legend.position = "none")
`geom_smooth()` using formula = 'y ~ x'

#Creating a histogram graph to find out the correlation between house price and the structure quality

miami %>%
  filter(!(structure.quality %in% NA)) %>%
  ggplot() +
  geom_histogram(mapping = aes(x = SALE_PRC, fill = structure.quality)) +
  labs(title = "Sale price by struvture quality", x = "Sale price", y = "Count") +
  scale_fill_brewer(palette = "Spectral")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Creating a histogram graph to find out the correlation between house price and the airplane noise level

miami %>%
  filter(!(avno60_plus %in% NA)) %>%
  ggplot() +
  geom_histogram(mapping = aes(x = SALE_PRC, fill = avno60_plus)) +
  labs(title = "Sale price by airplane noise level", x = "Sale price", y = "Count") +
  scale_fill_brewer(palette = "YlGnBu")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Conclusion

According to the Kaggle’s Miami House Dataset, home values are influenced by the distance to the ocean, railroads, central business district, and airport noise. Coastal properties, offering better views and convenience, usually have higher values. The least impactful factor is railroads, while CBD proximity greatly affects values, given its role as a city’s economic hub. Moreover, floor area impacts price more than land area. This may be because the usable area of a building is more important than the land area. This phenomenon is common in high-density or land-scarce areas. Interestingly, the structure quality doesn’t significantly affect prices, suggesting buyers prioritize location and size over construction quality.